Goto

Collaborating Authors

 Blue Earth County


Copula Based Fusion of Clinical and Genomic Machine Learning Risk Scores for Breast Cancer Risk Stratification

Aich, Agnideep, Hewage, Sameera, Murshed, Md Monzur

arXiv.org Machine Learning

Clinical and genomic models are both used to predict breast cancer outcomes, but they are often combined using simple linear rules that do not account for how their risk scores relate, especially at the extremes. Using the METABRIC breast cancer cohort, we studied whether directly modeling the joint relationship between clinical and genomic machine learning risk scores could improve risk stratification for 5-year cancer-specific mortality. We created a binary 5-year cancer-death outcome and defined two sets of predictors: a clinical set (demographic, tumor, and treatment variables) and a genomic set (gene-expression $z$-scores). We trained several supervised classifiers, such as Random Forest and XGBoost, and used 5-fold cross-validated predicted probabilities as unbiased risk scores. These scores were converted to pseudo-observations on $(0,1)^2$ to fit Gaussian, Clayton, and Gumbel copulas. Clinical models showed good discrimination (AUC 0.783), while genomic models had moderate performance (AUC 0.681). The joint distribution was best captured by a Gaussian copula (bootstrap $p=0.997$), which suggests a symmetric, moderately strong positive relationship. When we grouped patients based on this relationship, Kaplan-Meier curves showed clear differences: patients who were high-risk in both clinical and genomic scores had much poorer survival than those high-risk in only one set. These results show that copula-based fusion works in real-world cohorts and that considering dependencies between scores can better identify patient subgroups with the worst prognosis.




Bag of Coins: A Statistical Probe into Neural Confidence Structures

Aich, Agnideep, Aich, Ashit Baran, Murshed, Md Monzur, Hewage, Sameera, Wade, Bruce

arXiv.org Machine Learning

Modern neural networks, despite their high accuracy, often produce poorly calibrated confidence scores, limiting their reliability in high-stakes applications. Existing calibration methods typically post-process model outputs without interrogating the internal consistency of the predictions themselves. In this work, we introduce a novel, non-parametric statistical probe, the Bag-of-Coins (BoC) test, that examines the internal consistency of a classifier's logits. The BoC test reframes confidence estimation as a frequentist hypothesis test: does the model's top-ranked class win 1-v-1 contests against random competitors at a rate consistent with its own stated softmax probability? When applied to modern deep learning architectures, this simple probe reveals a fundamental dichotomy. On Vision Transformers (ViTs), the BoC output serves as a state-of-the-art confidence score, achieving near-perfect calibration with an ECE of 0.0212, an 88% improvement over a temperature-scaled baseline. Conversely, on Convolutional Neural Networks (CNNs) like ResNet, the probe reveals a deep inconsistency between the model's predictions and its internal logit structure, a property missed by traditional metrics. We posit that BoC is not merely a calibration method, but a new diagnostic tool for understanding and exposing the differing ways that popular architectures represent uncertainty.


A Machine Learning Framework for Breast Cancer Treatment Classification Using a Novel Dataset

Hasan, Md Nahid, Murshed, Md Monzur, Hasan, Md Mahadi, Chowdhury, Faysal A.

arXiv.org Artificial Intelligence

Breast cancer (BC) remains a significant global health challenge, with personalized treatment selection complicated by the disease's molecular and clinical heterogeneity. BC treatment decisions rely on various patient-specific clinical factors, and machine learning (ML) offers a powerful approach to predicting treatment outcomes. This study utilizes The Cancer Genome Atlas (TCGA) breast cancer clinical dataset to develop ML models for predicting the likelihood of undergoing chemotherapy or hormonal therapy. The models are trained using five-fold cross-validation and evaluated through performance metrics, including accuracy, precision, recall, specificity, sensitivity, F1-score, and area under the receiver operating characteristic curve (AUROC). Model uncertainty is assessed using bootstrap techniques, while SHAP values enhance interpretability by identifying key predictors. Among the tested models, the Gradient Boosting Machine (GBM) achieves the highest stable performance (accuracy = 0.7718, AUROC = 0.8252), followed by Extreme Gradient Boosting (XGBoost) (accuracy = 0.7557, AUROC = 0.8044) and Adaptive Boosting (AdaBoost) (accuracy = 0.7552, AUROC = 0.8016). These findings underscore the potential of ML in supporting personalized breast cancer treatment decisions through data-driven insights.


CopulaSMOTE: A Copula-Based Oversampling Approach for Imbalanced Classification in Diabetes Prediction

Aich, Agnideep, Murshed, Md Monzur, Hewage, Sameera, Mayeaux, Amanda

arXiv.org Machine Learning

Diabetes mellitus poses a significant health risk, as nearly 1 in 9 people are affected by it. Early detection can significantly lower this risk. Despite significant advancements in machine learning for identifying diabetic cases, results can still be influenced by the imbalanced nature of the data. To address this challenge, our study considered copula-based data augmentation, which preserves the dependency structure when generating data for the minority class and integrates it with machine learning (ML) techniques. We selected the Pima Indian dataset and generated data using A2 copula, then applied four machine learning algorithms: logistic regression, random forest, gradient boosting, and extreme gradient boosting. Our findings indicate that XGBoost combined with A2 copula oversampling achieved the best performance improving accuracy by 4.6%, precision by 15.6%, recall by 20.4%, F1-score by 18.2% and AUC by 25.5% compared to the standard SMOTE method. Furthermore, we statistically validated our results using the McNemar test. This research represents the first known use of A2 copulas for data augmentation and serves as an alternative to the SMOTE technique, highlighting the efficacy of copulas as a statistical method in machine learning applications.


A2 Copula-Driven Spatial Bayesian Neural Network For Modeling Non-Gaussian Dependence: A Simulation Study

Aich, Agnideep, Hewage, Sameera, Murshed, Md Monzur, Aich, Ashit Baran, Mayeaux, Amanda, Dey, Asim K., Das, Kumer P., Wade, Bruce

arXiv.org Machine Learning

In this paper, we introduce the A2 Copula Spatial Bayesian Neural Network (A2-SBNN), a predictive spatial model designed to map coordinates to continuous fields while capturing both typical spatial patterns and extreme dependencies. By embedding the dual-tail novel Archimedean copula viz. A2 directly into the network's weight initialization, A2-SBNN naturally models complex spatial relationships, including rare co-movements in the data. The model is trained through a calibration-driven process combining Wasserstein loss, moment matching, and correlation penalties to refine predictions and manage uncertainty. Simulation results show that A2-SBNN consistently delivers high accuracy across a wide range of dependency strengths, offering a new, effective solution for spatial data modeling beyond traditional Gaussian-based approaches.


Can Copulas Be Used for Feature Selection? A Machine Learning Study on Diabetes Risk Prediction

Aich, Agnideep, Murshed, Md Monzur, Mayeaux, Amanda, Hewage, Sameera

arXiv.org Machine Learning

Accurate diabetes risk prediction relies on identifying key features from complex health datasets, but conventional methods like mutual information (MI) filters and genetic algorithms (GAs) often overlook extreme dependencies critical for high-risk subpopulations. In this study we introduce a feature-selection framework using the upper-tail dependence coefficient (λU) of the novel A2 copula, which quantifies how often extreme higher values of a predictor co-occur with diabetes diagnoses (target variable). Applied to the CDC Diabetes Health Indicators dataset (n=253,680), our method prioritizes five predictors (self-reported general health, high blood pressure, body mass index, mobility limitations, and high cholesterol levels) based on upper tail dependencies. These features match or outperform MI and GA selected subsets across four classifiers (Random Forest, XGBoost, Logistic Regression, Gradient Boosting), achieving accuracy up to 86.5% (XGBoost) and AUC up to 0.806 (Gradient Boosting), rivaling the full 21-feature model. Permutation importance confirms clinical relevance, with BMI and general health driving accuracy. To our knowledge, this is the first work to apply a copula's upper-tail dependence for supervised feature selection, bridging extreme-value theory and machine learning to deliver a practical toolkit for diabetes prevention.


Fine-Tuning Federated Learning-Based Intrusion Detection Systems for Transportation IoT

Akinie, Robert, Gyimah, Nana Kankam Brym, Bhavsar, Mansi, Kelly, John

arXiv.org Artificial Intelligence

The rapid advancement of machine learning (ML) and on-device computing has revolutionized various industries, including transportation, through the development of Connected and Autonomous Vehicles (CAVs) and Intelligent Transportation Systems (ITS). These technologies improve traffic management and vehicle safety, but also introduce significant security and privacy concerns, such as cyberattacks and data breaches. Traditional Intrusion Detection Systems (IDS) are increasingly inadequate in detecting modern threats, leading to the adoption of ML-based IDS solutions. Federated Learning (FL) has emerged as a promising method for enabling the decentralized training of IDS models on distributed edge devices without sharing sensitive data. However, deploying FL-based IDS in CAV networks poses unique challenges, including limited computational and memory resources on edge devices, competing demands from critical applications such as navigation and safety systems, and the need to scale across diverse hardware and connectivity conditions. To address these issues, we propose a hybrid server-edge FL framework that offloads pre-training to a central server while enabling lightweight fine-tuning on edge devices. This approach reduces memory usage by up to 42%, decreases training times by up to 75%, and achieves competitive IDS accuracy of up to 99.2%. Scalability analyses further demonstrates minimal performance degradation as the number of clients increase, highlighting the framework's feasibility for CAV networks and other IoT applications.


Adaptive Object Detection for Indoor Navigation Assistance: A Performance Evaluation of Real-Time Algorithms

Pratap, Abhinav, Kumar, Sushant, Chakravarty, Suchinton

arXiv.org Artificial Intelligence

-- This study addresses the critical need for accurate and efficient object detection in assistive technologies for visually impaired individuals. We systematically evaluate the performance of four prominent real-time object detection algorithms--YOLO, SSD, Faster R-CNN, and Mask R-CNN--within the context of indoor navigation assistance. Our analysis, conducted on the Indoor Objects Detection dataset, focuses on key parameters including detection accuracy, processing speed, and adaptability to the unique challenges of indoor environments. This research contributes to a deeper understanding of adaptive machine learning applications that can significantly improve indoor navigation solutions for the visually impaired, promoting inclusivity and accessibility. In today's technology-driven society, there is an increasing emphasis on enhancing accessibility for visually impaired individuals.